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Abstract 

We consider a class of sparsity-inducing regularization terms based on submodular functions. 
While previous work has focused on non-decreasing functions, we explore symmetric submod- 
ular functions and their Lovasz extensions. We show that the Lovasz extension may be seen 
as the convex envelope of a function that depends on level sets (i.e., the set of indices whose 
corresponding components of the underlying predictor are greater than a given constant): this 
leads to a class of convex structured regularization terms that impose prior knowledge on the 
level sets, and not only on the supports of the underlying predictors. We provide a unified set of 
optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level 
sets and recovery conditions). By selecting specific submodular functions, we give a new inter- 
pretation to known norms, such as the total variation; we also define new norms, in particular 
ones that are based on order statistics with application to clustering and outlier detection, and 
on noisy cuts in graphs with application to change point detection in the presence of outliers. 

1 Introduction 

The concept of parsimony is central in many scientific domains. In the context of statistics, signal 
processing or machine learning, it may take several forms. Classically, in a variable or feature 
selection problem, a sparse solution vi^ith many zeros is sought so that the model is either more 
interpretable, cheaper to use, or simply matches available prior knowledge (see, e.g., [1] [U [3] and 
references therein). In this paper, we instead consider sparsity-inducing regularization terms that 
will lead to solutions with many equal values. A classical example is the total variation in one or two 
dimensions, which leads to pieccwisc constant solutions [31 [5] and can be applied to various image 
labelling problems [51 [S] , or change point detection tasks [3 ISl [3] • Another example is the "Oscar" 
penalty which induces automatic grouping of the features [10]. In this paper, we follow the approach 
of [3], who designed sparsity-inducing norms based on non- decreasing submodular functions, as a 
convex approximation to imposing a specific prior on the supports of the predictors. Here, we show 
that a similar parallel holds for some other class of submodular functions, namely non- negative set- 
functions which are equal to zero for the full and empty set. Our main instance of such functions 
are symmetric submodular functions. 

We make the following contributions: 

— We provide in Section [3] explicit links between priors on level sets and certain submodular 
functions: we show that the Lovasz extensions (see, e.g., [TT] and a short review in Section [5]) 
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associated to these submodular functions are the convex envelopes (i.e., tightest convex lower 
bounds) of specific functions that depend on all level sets of the underlying vector. 

— In Sectional we reinterpret existing norms such as the total variation and design new norms, 
based on noisy cuts or order statistics. We propose applications to clustering and outlier 
detection, as well as to change point detection in the presence of outliers. 

— We provide unified algorithms in Section [5j such as proximal operators, which are based on a 
sequence of submodular function minimizations (SFMs), when such SFMs are efficient, or by 
adapting the generic slower approach of [3] otherwise. 

— We derive unified theoretical guarantees for level set recovery in Section [SI showing that even 
in the absence of correlation between predictors, level set recovery is not always guaranteed, a 
situation which is to be contrasted with traditional support recovery situations [TJ [3] . 

Notation. For w gMP and q G [1, oo], we denote by \\w\\q the €g-norm of w. Given a subset A of 
V ~ {1, . . . ,p}, 1a S {0,1}^ is the indicator vector of the subset A. Moreover, given a vector w and 
a matrix Q, wa and Qaa denote the corresponding subvector and submatrix of w and Q. Finally, 
for w £ W and A C V, w{A) = X^fceA^^ ~ w^^a (this defines a modular set-function). In this 
paper, for a certain vector w gMp, we call level sets the sets of indices which are larger (or smaller) 
or equal to a certain constant a, which we denote {w ^ a} (or {w ^ a}), while we call constant sets 
the sets of indices which arc equal to a constant a, which we denote {w = a}. 

2 Review of Submodular Analysis 

In this section, we review relevant results from submodular analysis. For more details, see, e.g., jl2] . 
and, for a review with proofs derived from classical convex analysis, see, e.g., 

Definition. Throughout this paper, we consider a submodular function F defined on the power 
set 2^ of F = {1, . . . ,p}, i.e., such that VA, B cV, F{A) + F{B) > F{A U B) + F{A n B). Unless 
otherwise stated, we consider functions which are non- negative (i.e., such that F{A) ^ for all 
A C V), and that satisfy F{0) = F{V) = 0. Usual examples are symmetric submodular functions, 
i.e., such that VA C V,F{V\A) = F{A), which arc known to always have non-negative values. 
We give several examples in Section 21 for illustrating the concepts introduced in this section and 
Section|31 we will consider the cut in an undirected chain graph, i.e., F{A) ~ KIa)^ ^ (1^)^+1 1- 

Lovasz extension. Given any set-function F such that F{V) = F{0) = 0, one can define its Lovdsz 
extension / : R, as f{w) — \^F{{w ^ a})(ia (see, e.g., [IT] for this particular formulation). 

The Lovasz extension is convex if and only if F is submodular. Moreover, / is pieccwise-linear and 
for all A C 1^, /(1a) — that is, it is indeed an extension from 2^ (which can be identified to 

{0, 1}'' through indicator vectors) to W . Finally, it is always positively homogeneous. For the chain 
graph, we obtain the usual total variation /(w) = ~ ''^i+il- 

Base polyhedron. We denote by B{F) = {s e R^, C V, s[A) < F{A), s{V) = F{V)} 
the base polyhedron [12], where we use the notation s{A) = X^fceA'^fc- important result in 

submodular analysis is that if is a submodular function, then we have a representation of / as 
a maximum of linear functions |12[ 111], i.e., for all w S R^, f{w) = maxsg^j-p-) w'^s. Moreover, 
instead of solving a linear program with 2^ contraints, a solution s may be obtained by the following 
"greedy algorithm" : order the components of w in decreasing order Wj^ ^ ■ ■ ■ ^ Wj , and then take 
for aU ke {!,..., p}, s,, = F({ji, . . .Jk}) - F({ji, ■ • ■ , Jfe-i}). 
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Tight and inseparable sets. The polyhedra U = {w E R^, f{w) ^ 1} and B{F) are polar to each 
other (see, e.g., [12] for definitions and properties of polar sets). Therefore, the facial structure of U 
may be obtained from the one of B{F). Given s G B{F), a set A C V is said tight if s{A) = F{A). 
It is known that the set of tight sets is a distributive lattice, i.e., if A and B are tight, then so are 
AU B and An B [HI EI- The faces of B{F) are thus intersections of hyperplanes {s{A) = F{A)} 
for A belonging to certain distributive lattices (see Prop.[3|). A set A is said separable if there exists 
a non-trivial partition oi A = BU C such that F{A) = F{B) + F{C). A set is said inseparable if it 
is not separable. For the cut in an undirected graph, inseparable sets are exactly connected sets. 

3 Properties of the Lovasz Extension 

In this section, we derive properties of the Lovasz extension for submodular functions, which go 
beyond convexity and homogeneity. Throughout this section, we assume that is a non-negative 
submodular set-function that is equal to zero at and V . This immediately implies that / is 
invariant by addition of any constant vector (that is, f{w -f a\v) = f{w) for all w & W and a € M), 
and that /(ly) = FiY) = 0. Thus, contrary to the non-decreasing case [5j, our regularizers are not 
norms. However, they are norms on the hyperplane {w^ly = 0} as soon as for A^ and A^V , 
F{A) > (which we assume for the rest of this paper). 

We now show that the Lovasz extension is the convex envelope of a certain combinatorial function 
which does depend on all levets sets {w a} of w G (sec proof in supplementary material): 

Proposition 1 (Convex envelope) The Lovasz extension f{w) is the convex envelope of the func- 
tion w i-> maxagR F{{w ^ a}) on the set [0, 1]^ + Rly = {w G W, max^gy Wk — min^gy ^ 1}. 

Note the difference with the result of [5] : we consider here a different set on which we compute the 
convex envelope ([0, 1]^ instead of [—1, 1]''), and not a function of the support of w, but of all 

its level setsj^ Moreover, the Lovasz extension is a convex relaxation of a function of level sets (of 
the form {vu ^ a}) and not of constant sets (of the form {w — a}). It would have been perhaps more 
intuitive to consider for example J^F({w = a})da, since it does not depend on the ordering of the 
values that w may take; however, the latter function does not lead to a convex function amenable 
to polynomial-time algorithms. This definition through level sets will generate some potentially 
undesired behavior (such as the well-known staircase effect for the one-dimensional total variation), 
as we show in Section [6] 

The next proposition describes the set of extreme points of the "unit ball" U = {w, f{w) ^ 1}, 
giving a first illustration of sparsity- inducing effects (see example in Figure [T]) . 

Proposition 2 (Extreme points) The extreme points of the set U n {w^ly = 0} are the projec- 
tions of the vectors \a/ F{A) on the plane {w^ Iv = 0}, for A such that A is inseparable for F and 
V\A is inseparable for B ^ F{A [J B) - f\A). 

Partially ordered sets and distributive lattices. A subset V of 2^ is a (distributive) lattice 
if it is invariant by intersection and union. We assume in this paper that all lattices contain the 
empty set and the full set V , and we endow the lattice with the inclusion order. Such lattices 
may be represented as a partially ordered set (poset) 11(2?) = {^i, . . . , A„i} (with order relationship 
^), where the sets Aj, j = 1, . . . ,m, form a partition of V (we always assume a topological ordering 
of the sets, i.e., Ai )^ Aj ^ i j). As illustrated in Figure[2l we go from T> to 11(2?), by considering 
all maximal chains in 2? and the differences between consecutive sets. We go from 11(2?) to 2?, by 
constructing all ideals of 11(2?), i.e., sets J such that if an element of n(2?) is lower than an element 

^Note that the support {w = 0} is a constant set which is the intersection of two level sets. 
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(1,0,0), 



W2> W[>VV3 




(0,1,1) 



(1,1,0)/F({1,2}) '(1,1,0) 
Figure 1: Top: Polyhedral level set of / (projected on the set w^ly = 0), for 2 different subniodular 
symmetric functions of three variables, with different inseparable sets leading to different sets of 
extreme points; changing values of F may make some of the extreme points disappear. The various 
extreme points cut the space into polygons where the ordering of the component is fixed. Left: 
F{A) = i\A\£{i,2} (s-U possible extreme points); note that the polygon need not be symmetric in 
general. Right: one-dimensional total variation on three nodes, i.e., F{A) ~ ~ l2eA| + |l2eA ~ 

Isgyil, leading to f(w) = \wi — W2 \ + \w2 — w^l, for which the extreme points corresponding to the 
separable set {1,3} and its complement disappear. 

^{2}-^ {1,2} ^{1,2,5,6}^ 
X ^ ^{1,2,3,4,5,6} 

^{5,6} ^{2,5,6}^ {2,3,4,5,6}^ {1} {3,4} 

Figure 2: Left: distributive lattice with 7 elements in 2^^'^'^'^'^'^\ represented with the Hasse 
diagram corresponding to the inclusion order (for a partial order, a Hasse diagram connects A to B 
if A is smaller than B and there is no C such that A is smaller than C and C is smaller than B). Right: 
corresponding poset, with 4 elements that form a partition of {1, 2, 3, 4, 5, 6}, represented with the 
Hasse diagram corresponding to the order )>= (a node points to its immediate smaller node according 
to :>=). Note that this corresponds to an "allowed" lattice (see Prop.[3|) for the one-dimensional total 
variation. 

of J, then it has to be in J (see [12] for more details, and an example in Figure [2]). Distributive 
lattices and posets are thus in one-to-one correspondence. Throughout this section, we go back and 
forth between these two representations. The distributive lattice will correspond to all authorized 
level sets {if ^ a} in a single face of U, while the elements of the posct are the constant sets (over 
which w is constant), with the order between the subsets giving partial constraints between the 
values of the corresponding constants. 

Faces of U. The faces of U are characterized by lattices X>, with their corresponding posets 
n(I?) — {Ai, . . . , Am}- We denote by (and by U-d its closure) the set of w € R'' such that 
(a) w is piecewise constant with respect to H(I?), with value Vi on A^, and (b) for all pairs (i,j), 
A, ^ Aj = 



> Vj. For certain lattices V, these will be exactly the relative interiors of all faces 



Proposition 3 (Faces oi U) The (non-empty) relative interiors of all faces ofU are exactly of the 
formlA^, where D is a lattice such that: 

(i) the restriction ofFtoV is modular, i.e., for all A, B e V, F{A)+F{B) = F{AUB)+F{AnB), 

(ii) for all j G {1, . . . , m}, the set Aj is inseparable for the function Cj i— > F{Bj^i U Cj) — F{Bj^i), 
where Bj-i is the union of all ancestors of Aj in H(2?), 

(Hi) among all lattices corresponding to the same unordered partition, V is a maximal element of 
the set of lattices satisfying (i) and (ii). 
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Figure 3: Three left plots: Estimation of noisy piecewise constant ID signal with outliers (indices 5 
and 15 in the chain of 20 nodes). Left: original signal. Middle: best estimation with total variation 
(level sets are not correctly estimated). Right: best estimation with the robust total variation 
based on noisy cut functions (level sets are correctly estimated, with less bias and with detection 
of outliers). Right plot: clustering estimation error vs. noise level, in a sequence of 100 variables, 
with a single jump, where noise of variance one is added, with 5% of outliers (averaged over 20 
replications) . 

Among the three conditions, the second one is the easiest to interpret, as it reduces to having 
constant sets which are inseparable for certain submodular functions, and for cuts in an undirected 
graph, these will exactly be connected sets. 

Since we are able to characterize all faces of U (of all dimensions) with non-empty relative interior, 
we have a partition of the space and any w € W which is not proportional to ly, will be, up to 
the strictly positive constant f{w), in exactly one of these relative interiors of faces; we refer to 
this lattice as the lattice associated to w. Note that from the face w belongs to, we have strong 
constraints on the constant sets, but we may not be able to determine all level sets of w, because 
only partial constraints are given by the order on n(I?). For example, in Figure [2l W2 may be larger 
or smaller than = wq (and even potentially equal, but with zero probability, see Section [6]). 



4 Examples of Submodular Functions 

In this section, we provide examples of submodular functions and of their Lovasz extensions. Some 
arc well-known (such as cut functions and total variations) , some arc new in the context of supervised 
learning (regular functions), while some have interesting effects in terms of clustering or outlier 
detection (cardinality-based functions). 

Symmetrization. From any submodular function G, one may define F{A) = G{A) + G{V\A) — 
G{0) — G{V), which is symmetric. Potentially interesting examples which are beyond the scope of 
this paper are mutual information, or functions of eigenvalues of submatrices [3]. 

Cut functions. Given a set of nonnegative weights d : V x V ^ M+, define the cut F{A) = 
J2k&A jevXA'^i^^ j)- Lovasz extension is equal to f{w) = J2k j£V '^(^^ — Wj)+ (which 

shows submodularity because / is convex), and is often referred to as the total variation. If the 
weight function d is symmetric, then the submodular function is also symmetric. In this case, it 
can be shown that inseparable sets for functions A H' F{A U B) — F{B) are exactly connected sets. 
Hence, constant sets are connected sets, which is the usual justification behind the total variation. 
Note however that some configurations of connected sets are not allowed due to the other conditions 
in Prop. [3] (see examples in Section [6|). In Figure [5] (right plot), we give an example of the usual 
chain graph, leading to the one-dimensional total variation [U [S] . Note that these functions can be 
extended to cuts in hypergraphs, which may have interesting applications in computer vision [6]. 
Moreover, directed cuts may be interesting to favor increasing or decreasing jumps along the edges 
of the graph. 
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Regular functions and robust total variation. By partial minimization, we obtain so-called 
regular functions [SI [5] . One application is "noisy cut functions" : for a given weight function 
d : W X W ^ where each node in W is uniquely associated in a node in V, we consider the 
submodular function obtained as the minimum cut adapted to A in the augmented graph (see right 
plot of Figure [5|): F{A) = minBcw J^keB jew\B '^(^i i) + -^l^^^l- This allows for robust versions 
of cuts, where some gaps may be tolerated. See examples in Figure[3l illustrating the behavior of the 
type of graph displayed in the bottom-right plot of Figure [3 where the performance of the robust 
total variation is significantly more stable in presence of outliers. 

Cardinality-based functions. For F{A) = where h is such that h{0) = h{p) = and 

h concave, we obtain a submodular function, and a Lovasz extension that depends on the order 
statistics of w, i.e., if wj^ ^ ■ ■ ■ ^ wj^, then f{w) = J2k=\ ^''■i^)i'^jk " ^jt+i)- While these examples 
do not provide significantly different behaviors for the non-decreasing submodular functions explored 
by [3] (i.e., in terms of support), they lead to interesting behaviors here in terms of level sets, i.e., 
they will make the components w cluster together in specific ways. Indeed, as shown in Section [SJ 
allowed constant sets A are such that A is inseparable for the function C M- h{\B U C|) — /i(|i?|) 
(where B G V is the set of components with higher values than the ones in A), which imposes that 
the concave function h is not linear on [\B\, \B\ + \A\]. We consider the following examples: 

1. F{A) — \A\ ■ \V\A\, leading to f{w) — \wi — Wj\. This function can thus be also seen 
as the cut in the fully connected graph. All patterns of level sets are allowed as the function h 
is strongly concave (see left plot of Figure S} . This function has been extended in [14] by 
considering situations where each Wj is a vector, instead of a scalar, and replacing the absolute 
value \wi — Wj\ by any norm \\wi — Wj\\, leading to convex formulations for clustering. 

2. F{A) ~ 1 ii A and A V, and otherwise, leading to f{w) = max^.j \wi~Wj\. Two large 
level sets at the top and bottom, all the rest of the variables arc in-between and separated 
(Figure m second plot from the left). 

3. F{A) = max{|A|, This function is piecewise affine, with only one kink, thus only 
one level set of cardinalty greater than one (in the middle) is possible, which is observed in 
Figure|4] (third plot from the left). This may have applications to multivariate outlier detection 
by considering extensions similar to [l4] . 



5 Optimization Algorithms 

In this section, we present optimization methods for minimizing convex objective functions regular- 
ized by the Lovasz extension of a submodular function. These lead to convex optimization problems, 
which we tackle using proximal methods (see, e.g., |15|). We first start by mentioning that subgradi- 
ents may easily be derived (but subgradient descent is here rather inefficient as shown in Figure [5]). 
Moreover, note that with the square loss, the rcgularization paths are piecewise affine, as a direct 
consequence of regularizing by a polyhedral function. 

Subgradient. From f{w) = maXsgs(_F) s'^w and the greedy algorithnJl presented in Section [21 
one can easily get in polynomial time one subgradient as one of the maximizers s. This allows to 
use subgradient descent, with slow convergence compared to proximal methods (see Figure [S]). 

Proximal problems through sequences of submodular function minimizations (SFMs). 

Given regularized problems of the form mhiw^^p L{w) + Xf{w), where L is differentiable with 

^The greedy algorithm to find extreme points of tiic base polyiicdron should not be confused with the greedy 
algorithm (e.g., forward selection) that is common in supervised learning/statistics. 
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Figure 4: Left: Piecewise linear regularization paths of proximal problems (Eq. ([T])) for different 
functions of cardinality. From left to right: quadratic function (all level sets allowed), second example 
in Section m (two large level sets at the top and bottom), piecewise linear with two pieces (a single 
large level set in the middle). Right: Same plot for the one-dimensional total variation. Note that 
in both cases the regularization paths for orthogonal designs are agglomerative (see Section [5]), while 
for general designs, they would still be piecewise afhne but not agglomerative. 

Lipschitz-continuous gradient, proximal methods have been shown to be particularly efficient first- 
order methods (see, e.g., [H]). In this paper, we use the method "ISTA" and its accelerated variant 
"FISTA" [15]. To apply these methods, it suffices to be able to solve efficiently: 

niin i||^«-z||l + A/(u;), (1) 

which we refer to as the proximal problem. It is known that solving the proximal problem is related to 
submodular function minimization (SFM). More precisely, the minimum of ^ i— > XF{A) — z{A) may 
be obtained by selecting negative components of the solution of a single proximal problem |121 111) . 
Alternatively, the solution of the proximal problem may be obtained by a sequence of at most p 
submodular function minimizations of the form A i— )■ XF(A) — z{A), by a decomposition algorithm 
adapted from [TB], and described in [TT) . 

Thus, computing the proximal operator has polynomial complexity since SFM has polynomial com- 
plexity. However, it may be too slow for practical purposes, as the best generic algorithm has 
complexity 0{p^) |17pl . Nevertheless, this strategy is efficient for families of submodular functions 
for which dedicated fast algorithms exist: 

- Cuts: Minimizing the cut or the partially minimized cut, plus a modular function, may be 
done with a min-cut/max-flow algorithm (see, e.g., [51 [S]). For proximal methods, we need in 
fact to solve an instance of a parametric max-flow problem, which may be done using other 
efficient dedicated algorithms |19[ [5] than the decomposition algorithm derived from [16] . 

- Functions of cardinality: minimizing functions of the form A i— >■ XF{A) ~ z{A) can be done 
in closed form by sorting the elements of z. 

Proximal problems through minimum- norm-point algorithm. In the generic case (i.e., 
beyond cuts and cardinality-based functions), we can follow [5]: since /{w) is expressed as a minimum 
of linear functions, the problem reduces to the projection on the polytope B{F), for which we happen 
to be able to easily maximize linear functions (using the greedy algorithm described in Section [2]) • 
This can be tackled efficiently by the minimum- norm-point algorithm [12) . which iterates between 
orthogonal projections on affine subspaccs and the greedy algorithm for the submodular functior|f|. 
We compare all optimization methods on synthetic examples in Figure O 

^Note that even in the case of symmetric submodular functions, where more efficient algorithms in 0{p'^) for 
submodular function minimization (SFM) exist [181 . the minimization of functions of the form XF(A) — z{A) is 
provably as hard as general SFM 1181 . 

^Interestingly, when used for submodular function minimization (SFM), the minimum- norm-point algorithm has 
no complexity bound but is empirically faster than algorithms with such bounds 1121 . 
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Figure 5: Left: Matlab running times of different optimization methods on 20 replications of a least- 
squares regression problem with p = 1000 for a cardinality-based submodular function (best seen in 
color). Proximal methods with the generic algorithm (using the minimum- norm-point algorithm) 
are faster than subgradient descent (with two schedules for the learning rate, 1/t or 1/ ^/i). Using the 
dedicated algorithm (which is not available in all situations) is significantly faster. Right: Examples 
of graphs (top: chain graph, bottom: hidden chain graph, with sets W and V and examples of a set 
A in light red, and B in blue, see text for details). 

Proximal path as agglomerative clustering. When A varies from zero to -l-oo, then the unique 
optimal solution of Eq. ([T]) goes from z to a constant. We now provide conditions under which 
the regularization path of the proximal problem may be obtained by agglomerative clustering (see 
examples in Figure |4]): 

Proposition 4 (Agglomerative clustering) Assume that for all sets A, B such that B O A = 
and A is inseparable for D i— > F{B U D) — F{B), we have: 

VC c A, {^[F(S U A) - F{B)] F{B U C) - F{B). (2) 

Then the regularization path for Eq. ([!]) is agglomerative, that is, if two variables are in the same 
constant for a certain fi g M+, so are they for all larger A ^ /i. 

As shown in the supplementary material, the assumptions required for by Prop. HI are satisfied by 
(a) all submodular set-functions that only depend on the cardinality, and (b) by the one-dimensional 
total variation — we thus recover and extend known results from [71 [20l [M] . 

Adding an ^i-norm. Following [4|, we may add the £i-norm ||w|ji for additional sparsity of w (on 
top of shaping its level sets). The following proposition extends the result for the one-dimensional 
total variation [H [3T] to all submodular functions and their Lovasz extensions: 

Proposition 5 (Proximal problem for ^i-penalized problems) The unique minimizer of ^\\w — 
2;||2 + /(ui) + A|| wjl 1 may be obtained by soft-thresholding the minimizers of ^\\w — z\\2 + f (w) . That 
is, the proximal operator for / + Aj| • ||i is equal to the composition of the proximal operator for f 
and the one for A|| • |ji. 



6 Sparsity-inducing Properties 

Going from the penalization of supports to the penalization of level sets introduces some complexity 
and for simplicity in this section, we only consider the analysis in the context of orthogonal design 
matrices, which is often referred to as the denoising problem, and in the context of level set estimation 
already leads to interesting results. That is, we study the global minimum of the proximal problem 
in Eq. ([T} and make some assumption regarding z (typically z = w* noise), and provide guarantees 
related to the recovery of the level sets of w*. We first start by characterizing the allowed level sets. 
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showing that the partial constraints defined in Section [3] on faces of {f{w) ^ 1} do not create by 
chance further groupings of variables (see proof in supplementary material). 

Proposition 6 (Stable constant sets) Assume z G has an absolutely continuous density with 
respect to the Lebesgue measure. Then, with probability one, the unique minimizer w of Eq. ^ has 
constant sets that define a partition corresponding to a lattice D defined in Prop. O 

We now show that under certain conditions the recovered constant sets are the correct ones: 

Theorem 1 (Level set recovery) Assume that z ~ w* +ae, where e €W is a standard Gaussian 
random vector, and z* is consistent with the lattice V and its associated poset 11(1?) = (Ai, . . . , Am), 
with values v* on Aj, for j £ {1, . . . , m}. Denote Bj = Ai U • • • U Aj for j G {1, . . . , m}. Assume 
that there exists some constants rjj > and v > such that: 

VQcA„F(i?,_iUC,)-F(i?,_i)-^[F(B,_iUA,)-F(i?,_i)] ^ r;,min{g^,l-g^}, (3) 

Vi, j e{l,..., m}, A, ^ Aj => V* - V* ^ ly, (4) 

Vj £ {1, ... , m}, A| I ^ iy/4.. (5) 

Then the unique minimizer w of Eq. (Qp is associated to the same lattice V than w* , with probability 
greater than 1 - cxp ( - ^^^) - 2 J^'Jl^ \Aj | exp ( - jF^^) ■ 

We now discuss the three main assumptions of Theorem [1] as well as the probability estimate: 

- Eq. (|3]) is the equivalent of the support recovery of the Lasso [1] or its extensions [3] . The main 
difference is that for support recovery, this assumption is always met for orthogonal designs, 
while here it is not always met. Interestingly, the validity of level set recovery implies the 
agglomerativity of proximal paths (Eq. ^ in Prop.Hl). 

Note that if Eq. ^ is satisfied only with rjj ^ (it is then exactly Eq. ([2]) in Prop.|4|), then, 
even with infinitesimal noise, one can show that in some cases, the wrong level sets may be 
obtained with non vanishing probability, while if r]j is strictly negative, one can show that 
in some cases, we never get the correct level sets. Eq. ([3]) is thus essentially sufficient and 
necessary. 

- Eq. Q corresponds to having distinct values of w* far enough from each other. 

- Eq. ([5]) is a constraint on A which controls the bias of the estimator: if it is too large, then 
there may be a merging of two clusters. 

- In the probability estimate, the second term is small if all a'^\Aj\~^ are small enough (i.e., 
given the noise, there is enough data to correctly estimate the values of the constant sets) and 
the third term is small if A is large enough, to avoid that clusters split. 

One-dimensional total variation. In this situation, we always get T]j = 0, but in some cases, it 
cannot be improved (i.e., the best possible rjj is equal to zero), and as shown in the supplementary 
material, this occurs as soon as there is a "staircase", i.e., a piecewise constant vector, with a 
sequence of at least two consecutive increases, or two consecutive decreases, showing that in the 
presence of such staircases, one cannot have consistent support recovery, which is a well-known issue 
in signal processing (typically, more steps are created). If there is no staircase effect, we have rjj = 1 
and Eq. ^ becomes A < | miuj \Aj\. If we take A equal to the limiting value in Eq. ([5]), then we 

obtain a probability less than 1 — 4pcxp(— i28cr"maj^|A-p )• ^^ote that we could also derive general 
results when an additional ^i-penalty is used, thus extending results from 
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Two-dimensional total variation. In this situation, even with only two different values for 
z* , then we may have rjj < 0, leading to additional problems, which has already been noticed in 
continuous settings (see, e.g., |23j and the supplementary material). 

Clustering with F{A) = \A\ ■ \V\A\. In this case, we have rjj = \Aj\/2, and Eq. ([5]) becomes 
A ^ leading to the probability of correct support estimation greater than 1 — 4pexp ( — -y^^)- 
This indicates that the noise variance tr^ should be small compared to which is not satisfactory 
and would be corrected with the weighting schemes proposed in |14| . 

7 Conclusion 

We have presented a family of sparsity-inducing norms dedicated to incorporating prior knowledge or 
structural constraints on the level sets of linear predictors. We have provided a set of common algo- 
rithms and theoretical results, as well as simulations on synthetic examples illustrating the behavior 
of these norms. Several avenues are worth investigating: first, we could follow current practice in 
sparse methods, e.g., by considering related adapted concave penalties to enhance sparsity-inducing 
capabilities, or by extending some of the concepts for norms of matrices, with potential applications 
in matrix factorization |24j or multi-task learning |25] . 
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A Proof of Proposition [T] 



Proof For any w £ MP, level sets of w are characterized by an ordered partition {Ai, . . . , so 
that w is constant on each Aj, with value tj, j = 1, . . . , m, and so that (tj) is a strictly decreasing 
sequence. We can now decompose minimization with respect to w using these ordered partitions 
and (tj). 

In order to compute the convex envelope, we simply need to compute twice the Fenchel conjugate 
of the function we want to find the envelope of (see, e.g., [57] for definitions and properties of 
Fenchel conjugates). 

Let s £ W; we consider the function g : w t-^ max^gR ^ a}), and we compute its Fenchel 
conjugate: 



dcf 



max ui^s — gi'w), 
welos]p+miv 



max < max 

partition ti>--->tm, ii— im^ 



lit 

t,s{A,) - max F{Ai U---UA,)\, 
max I max \^ (tj — tj+i)s{Ai U • • • U Aj) + tmsiV) 

4m) partition ti>--->t„, ^-^ 



— max F{Ai U • • • U Aj) > by integration by parts, 
je{i,...,m} J 

= i's(v)=ois) + max < max s(j4i U • • • U — max F(Ai U ■ ■ ■ U Aj) > , 

^ ' {Ai,...,A^) partition |^ ie{l,...,m-l} je{l,...,m} J 

= i's(v)=Qis) + max < max s(Ai U • • • U A,) — max U • • • U A,) >, 

^ partition [ iG{l,...,rn-l} je{l,...,rn-l} '' J 

where ts(v)=o is the indicator function of the set {s{V) = 0} (with values or +oo). Note that 
maxjg{-i ,,„j F{Ai U • • • U Aj) = maxjg{i,,,,_,„„i} F{Ai U • • • U Aj) because F{V) = 0. 

Let h{s) = i's{V)=o{s) + mayiAcv{s{A) — F{A)}. We clearly have g*{s) > h{s), because we 
take a maximum over a larger set (consider m = 2). Moreover, for all partitions {Ai,...,Am), 
if s{V) = 0, maXj-£{i,...,„_i}s(Ai U ■ ■ ■ U A^) < maxjg{i,...,™_i}(/i(s) + i^(Ai U ••• U Aj)) = 
h{s) + maxjgj-i^ U • • • U Aj), which implies that g*{s) ^ /i(s). Thus g*(s) = /i(s). 

Moreover, we have, since / is invariant by adding constants and / is submodular, 

max s — fiw) = i's(v)=oi^) + max {w^ s — f(w)} 

w£[o,i]p+mv iue[o,i]p 

= t'siv)=o{s) + niax{s(A) - F{A)} = h{s), 

where we have used the fact that minimizing a submodular function is equivalent to minimizing its 
Lovasz extension on the unit hypercube. Thus / and g have the same Fenchel conjugates. The result 
follows from the convexity of/, using the fact the convex envelope is the Fenchel bi-conjugate [26ll27j . 
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B Proof of Proposition [2] 



Proof Extreme points of U correspond to full-dimensional faces of B{F). From Corollary 3.4.4 in 
[T^ . these facets are exactly the ones that correspond to sets A with the given conditions. These 
facets are defined as the intersection of {s{A) = F{A)} and {siy) = F{V)}, which leads to the 
desired result. Note that this is also a consequence of Prop.[3l Note that when F is symmetric, the 
second condition is equivalent to V\A being inseparable for F. ■ 



C Proof of Proposition [3] 

Proof Given that the polyhedra U and B{F) are polar to each other [T3], the proposition follows 
from Theorem 3.43 in |12| . where each of our three assumptions are equivalent to a corresponding 
one in Theorem 3.43 from [12]. ■ 



D Proof of Proposition |4] 

We first start by a lemma, which follows common practice in sparse recovery (assume a certain 
sparsity pattern and check when it is actually optimal): 

Lemma 1 (Optimality of lattice for proximal problem) The solution of the proximal prob- 
lem in Eq. (QP corresponds to a lattice T) if and only if v ^ {AI^ M)~^ {M'^ z ~ Xt) satisfies the order 
relationships imposed by V and 

hi- M{M'^ My^M'^)z + M{M'^M)-^t e B{F), 
X 

where M G RP'^™ is the indicator matrix of the partition 11(1?), and t^ = F{Ai U • • • U A^) - F{Ai U 
• • • U Ai_i), i = 1, . . . ,m. 

Proof Any w &W belongs to a single face relative interior from Prop. [3l defined by a lattice I?, 
i.e., w is constant on Ai with value Vi (which implies that w = Mv) and such that Vi > Vj as soon as 
Ai )p Aj. We assume a topological ordering of the sets Ai, i.e, Ai Aj ^ i ^ j . Since the Lovasz 
extension is linear for w in Lixi (and equal to t^w for w = Mv), the optimum over w can be found 
by minimizing with respect to v 

^\\z-Mv\\l + Xt'^v. 
We thus get, by setting the gradient to zero: 

V = {M^M)-^{M^z - Xt). 

Optimality conditions for w for Eq. ([1]) are that w — z + Xs — 0, for s G B{F) and f{w) = s (these 
arc obtained from general optimality conditions for functions defined as pointwise maxima |27|). 
Thus our candidate w = Mv is optimal if and only if Mv — z + Xs = w — z + Xs = for (a) s € B{F) 
and (b) f{w) = w^s. From Prop. 10 in |TI], for (b) to be valid, s G B{F) simply has to satisfy 
s{Ai U • • • U AO = F{Ai U • • • U Ai) for all i. 
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Note that 

z - Mv = {I - M{M^M)-^M^)z + \M{M^ M)-^t, 
and that for alH e {1, . . . , m}, 

- M(M^ M)-^M'^)z = 5j {I - M{M^ M)-^ M'^)z = 0, 

where 5i is indicator vector of the singleton {i}. Moreover, we have 

1\M{M'^ My^t = U ^ F{Ai U • • • U A,) - F{Ai U • • • U Ai_i), 

so that, ii B^ ^ AlU■■■UA^, [{I - M{M^ M)-^M^)z]{Bi) = 0, [M{M^ M)-^t]{Bi) = F{B^), for 
all z e {1, . . . , m}. This implies that [j{z - Mv)] (Ai) = U, and thus - Mv)] (B,) = F{Bi). 

Thus, if (a) is satisfied, then (b) is always satisfied. Thus to check if a certain lattice leads to the 
optimal solution, we simply have to check that Af (A/^M)"iM^)z + Af (A/^M)"H e B{F). ■ 

We now turn to the proof of Proposition |4l 

Proof We show that when A increases, we move to a lattice which has to be merging some constant 
sets. Let us assume that a lattice V is optimal for a certain /i. Then, from Lemma [1] we have 

-{I - M{M'^ M)-'^M^)z + M{M^M)-h £ B{F). 
M 

Moreover, since from Prop.|3l Ai is separable for Ci F{Bi^iUCi) — F{Bi^i), from the assumption 
of the proposition, wc obtain: 

VC, C A, [M{M^M)-H]{a) = U A,) - F{B,^,)) ^ F{B,^, U C,) - 

which implies, for all C C V^: 

m 

[M{M^M)-h]{C) = ^[7\/(M^i\/)-4](Cn A;) by modularity, 

^ J2 (^(-B^-i U (C n A,)) - j from above, 

^ J2 (^((-Sj-i n C) U (C n A^)) - F{B,^i n C)| by submodularity, 
i=i ^ ^ 

= X]|F(s,nc)-F(s,_inc)| = F(C). 
i=i ^ ' 

Thus, for any set C, we have for A ^ /i (which implies j S [0, 1]), 
- M{M^M)~Hl^)z + M{M^M)~h] (C) 

= Y [-(I - M{M^M)-Hl^)z + M{M^M)'H] (C) + (1 - ^)[M{M^M)~H] (C) 
A /i A 

< l^F{C) + (l-^)F{C)=F{C). 
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Thus the second condition in Lemma [T] is satisfied, thus it has to be tlic first one wliicli is violated, 
leading to merging two constant sets. ■ 



We now show that for special cases, the condition in Eq. 1^ is satisfied, and we also show when the 
condition in Eq. ([3]) of Theorem [T] is satisfied or not: 



Cardinality-based functions: the condition in Eq. ([2]) is equivalent to 

h{\B\ + \A\)~h{\B\) ^ hi\B\ + \C\)-hi\B\) 
\A\ - |C| 

which is a consequence of the concavity of h. Moreover the condition in Eq. (jS]) is equivalent 
to 

h{\B\ + \C\) - h{\B\) - + 1^1) - H\B\)] > '?min {M, 1 - |^}- 

For h{t) = t{p — t), this is equivalent to 

|^|(|C|~|^|)^r;min{M,l-M}, 

which is true as soon as rj ^ |^|/2. 

• One-dimensional total variation: we assume that we have a chain graph. Note that A 
must be an interval and that B only enters the problem if one of its elements is a neighbor 
of one of the two extreme elements of A. We thus have eight cases, depending on the three 
possibilities for these two neighbors of A (in B, in V\B, or no neighbor, i.e., end of the chain). 
We consider all 8 cases, where C is a non trivial subset of A, and compute a lower bound on 
F{B U C) - F{B) ~ U A) - F{B)\. 

- left: B, right: B. F{B) = 2, F{B UA)=0, F{CUB) ^ 2. Bounds 2|§| 

- left: B, right: V\B. F{B) = 1, F{B U A) = 1, F{C U B) ^ 1. Bounds 

- left: B, right: none. F{B) = 1, F{B U A) = 0, F{C U B) ^ 1. Bound= |^ 

- left: V\B, right: B. F{B) = 1, F{B U A) = 1, F{C U B) ^ 1. Bounds 

- left: V\B, right: V\B. F{B) = 0, F{B U A) = 2, F{C U B) ^ 2. Bound= 2 - 2|§| 



left: V\B, right: none. F{B) ^ 0, F{B U A) = 1, F{C U B) ^ 1. Bound= 1 



\c\ 

\A\ 



left: none, right: B. F{B) = 2, F{B U Tl) = 0, F{C U B) ^ 2. Bounds 



left: none, right: none. F{B) = 0, F{B U A) = 0, F{C U B) ^ I. Bound= 1. 



left: none, right: V\B. F{B) = 1, F{B U A) = 0, F{C U B) ^ 1. Bound-- jj-^ 



Considering all cases, we get a lower bound of zero, which shows that the paths are agglomera- 
tive. However, there are two cases where no strictly positive lower bounds are possible, namely 
when the two extremities of A have respective neighbors in B and V\B. Given that i? is a set 
of higher values for the parameters and V\{A U -B) is a set of lower values, this is exactly a 
staircase. When there is no such staircase, we get a lower bound of min{|A|/|C|, 1 — |^|/|C|}, 
hence r] = 1. 
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E Proof of Proposition [5] 



Proof We denote by w the unique mininizer of ^\\w — z\\2 + f{w) and s the associated dual vector 
in B{F). The optimahty conditions are w — z -\- s — and /(w) = s (again from optimahty 
conditions for pointwise maxima). 

We assume that w takes distinct values vi,. . . ,Vm on the sets Ai,. . . , Am- We define t as tk = 
sign( Wfe ) ( I Wfc I — A) + (which is the unique minimizer ofi|jw — t|j2 + A||t||i). The constant sets of t are 
Aj, for j such that \vj\ > A and zero for the union of all Aj's such that \vj\ ^ A. Since t is obtained 
by soft-thresholding w, which corresponds to £i-proximal problem, we have that t — w + \q = Q with 
||g||oo < 1 and q^t=\\t\\i. 

By combining these two equalities, with have z + s + Ag = with \\q\\oo ^ 1, q^t = \\t\\i 
and s e B{F). The only remaining element to show that t is optimal for the full problem is that 
f{t) = s^t. This is true since the level sets of w are finer than the ones of t (i.e., it is obtained by 
grouping some values of w) , with no change of ordering . ■ 



F Proof of Proposition [6] 

Proof From Lemma [U the solution has to correspond to a lattice V and wc only have to show 
that with probability one, the vector v = (M^ M)~^{M^ z — Xt) has distinct components, which is 
straightforward because it has an absolutely continuous density with respect to the Lcbesgue mea- 
sure. ■ 



G Proof of Theorem [T] 

Proof From Lemma [H in order to correspond to the same lattice T), we simply need that (a) 
V = {AI^ M)~^{M^ z — Xt) satisfies the order relationships imposed by "D and that (b) 

i(/ - M{M'^M)-^M'^)z + M{M'^M)-^t e B{F). 
A 

Condition (a) is satisfied as soon as \\w — u'*||oo ^ which is implied by 

cr||(M^M)-^M^e||oo s$ t//4 and ||A(M^Af)~4||oo i^/4. (6) 

The second condition in Eq. ^ is met by assumption, while the first one leads to the sufficient 
conditions Vj, |e(ylj)| ^ v\Aj\j^G, leading by the union bound to the probabilities X^JLi ^^P ( ~ 

Following the same reasoning than in the proof of Prop. |4l condition (b) is satisfied as soon as for 
all j G {1, . . . , m}, and all Cj C Aj, 

{o\{I M{M-^M)-Hi-')e\ (C,) ^ min 1 - 
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Indeed, this implies that for all j 

rl 



(/ - M{M^M)-H-I^)z + M{M^M)-H] (Cj) 
A 

[^(/ - M{M^M)-^M^)e + M{M^M)-H] (Q) 
A 



11^. I' 

< F(i?,_iUQ)-F(i?,_i), 



s; min <j 1 - ^ + U A,) - 



which leads to [j;{I-M{M^ M)-'^M^)z + M{M^ M)-^t] e S(i^) using the sequence of inequalities 
used in the proof of Prop. SI 



From Lemma [2] below, we thus get the probability 2 X^Jli l^jl "^^P ( 



Lemma 2 For -F(^) = min {^, 1 — ^ }, a'^rf s normal with mean zero and variance I ^'^T'j 

have: 

max — r—rr ^ t) ^ 2w exp 

Acv^A^0,A^v F{A) ^ J ^ ^ t' y 

Proof Since F depends on uniquely on the cardinality \A\ and is symmetric we have, with s e 
the sorted (in descending order) components of s, and h{a) = min{a/p, 1 — a/p}: 

P( max -V-7T ^ t) 

^ ACV,A^0,A^V F[A) ' 

, S({l,...,fc}) , 

= P max S^J^Jl^t) 

^ P( max ^' ^ t) +P( max ^' ^ t) 

H^e{i....,Lp/2j-i} h(k) ' ^fee{Lp/2j,...,p-i} h(k) ' 

s({l,... k\) 

^ 2P( max — — ^ ^ t) because of symmetry due to the covariance of s 

^A;e{l,....Lp/2J-l} k/p ' 

^ 2P( max y\ ^ ^ n ^ A 

'fee{i,...,Lp/2J-i} k/p 

^ 2P( max Sfe ^ t/p) 
^ 2pcxp(-tVV)- 



We now consider the three special cases: 

• One-dimensional total variation: without the staircase effect, as shown in Appendix |D1 
we have rjj = 1. Moreover, \F{Bj) — ^ 2, and thus Eq. ((5)) leads to A < | miuj \Aj\. 
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Figure 6: Signal approximation with the two-dimensional total variation: For two piecewise constant 
images with two values, the estimation may (left case) or may not (right case) recover the correct 
level sets, even with infinitesimal noise. For the two cases, left: original pattern, right: best possible 
recovered level sets. 

Using the largest possible A in Eq. ([5]), we obtain a probability greater than 

Tri 21/11 ^ \ 2 2 



i=i i=i 
j=i j=i 



i-i:-(-^)-i:M,i»p(-Ti,S 



> l-E°^P(-^^)-2pcxp 
^ 1 — 4p exp I 



z^^U,!, , t/^min, U |2 

^)-2pc""^ ^' ■ 

v'^ miuj p 



32cr2 ^ ^ i28cr2inaxj \Aj\^' 



128cr2 maxj 

because the second term is always greater than the third one. 
• Two-dimensional total variation: we simply build the following counter-example: 



where B are the black nodes, C the gray nodes and A the complement of B. We indeed 
have A connected, and F{B U C) - F{B) = 4 - 5 = -1, F{B U A) - F{B) = -5, leading to 
F{B U C) - F{B) - j^[F(S U A) - F{B)] = -1 + 5 x ^ = 

We also illustrate this in Figure [51 where we show that depending on the shape of the level 
sets (which still have to be connected), we may not recover the correct pattern, even with very 
small noise. 
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W^=W2 

A 



W3> W^>W2 



(1,0,1)/F({1,3}) 



Wi> W3>W2 



(1,0,0)/F({1}) 



W2=W3 Wi> W2>W3 



(0,0,1)/F({3}) 




W3> W2>W[ 



(ai,l)/F({2,3}) 



W2> W3>Wi 
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W2> Wi>W^^^l 3 
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